Today we will…
You will be completing a final project in Stat 331/531 in teams of four.
When parsing dates and times, we have to consider complicating factors like…
lubridateCommon Tasks
Convert a date-like variable (“May 8, 1995”) to a date or date-time object.
Find the weekday, month, year, etc from a date-time object.
Convert between time zones.
Note
The lubridate package loads with tidyverse.
date-time ObjectsThere are multiple data types for dates and times.
date or DatedtmPOSIXlt – stores date-times as the number of seconds since January 1, 1970 (“Unix Epoch”)POSIXct – stores date-times as a list with elements for second, minute, hour, day, month, year, etc.date-time ObjectsCreate a date from individual components:
Create a date from a string:
date-time ObjectsWhat’s wrong here?
Make sure you use quotes!
date-time Componentsdate-time ObjectsDoing subtraction gives you a difftime object.
difftime objects do not always have the same units – it depends on the scale of the objects you are working with.How old am I?
How long did it take me to finish a typing challenge?
Durations will always give the time span in an exact number of seconds.
We can also add time:
days(), years(), etc. will add a period of time.ddays(), dyears(), etc. will add a duration of time.Time zones are complicated!
Specify time zones in the form:
You can change the time zone of a date in two ways:
Keeps the instant in time the same, but changes the visual representation.
When you read data in or create a new date-time object, the default time zone (if not specified) is UTC.
Make sure you specify your desired time zone!
One of the most famous mysteries in California history is the identity of the so-called “Zodiac Killer”, who murdered 7 people in Northern California between 1968 and 1969. A new murder was committed last year in California, suspected to be the work of a new Zodiac Killer on the loose.
Unfortunately, the date and time of the murder is not known. You have been hired to crack the case. Use the clues below to discover the murderer’s identity.
Submit the name of the killer to the Canvas Quiz.
Today we will…
.qmd template for the Short Answer..qmd for the Open-Ended Analysis. You are encouraged to create this ahead of time.Caution
While the coding tasks are open-resource, you will likely run out of time if you have to look everything up. Know what functions you might need and where to find documentation for implementing these functions.
stringrA string is a bunch of characters.
Don’t confuse a string (many characters, one object) with a character vector (vector of strings).
stringrCommon tasks
Find which strings contain a particular pattern
Remove or replace a pattern
Edit a string (for example, make it lowercase)
Note
The package stringr is very useful for strings!
stringr loads with the tidyverse.
all the functions are str_xxx().
pattern =The pattern argument in all of the stringr functions …
Note
Discuss with a neighbor. For each of these functions, give:
str_detect()Returns logical vector TRUE/FALSE indicating if the pattern was found in that element of the original vector
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")[1] FALSE FALSE TRUE TRUE
filter()summarise() and sum or meanRelated functions
str_subset() returns just the strings that contain the match
str_which() returns the indexes of strings that have a match
str_match()Returns character matrix with either NA or the pattern, depending on if the pattern was found.
str_extract()Returns character vector with either NA or the pattern, depending on if the pattern was found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")[1] NA NA "Bond" "Bond"
Warning
str_extract() only returns the first pattern match.
Use str_extract_all() to return every pattern match.
str_locate()Returns a date frame with two numeric variables for the starting and ending location, giving either NA or the start and end position of the pattern.
str_subset()Returns a character vector with a subset of the original character vector with elements where the pattern occurs.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")[1] "Bond" "James Bond"
Related Functions
str_sub() extracts values based on location.
replaces the first matched pattern
mutate()Related functions
str_replace_all() replaces all matched patterns
str_remove_all() removes all matched patterns
Convert letters in the string to a specific capitalization format.
converts all letters in the strings to lowercase
converts all letters in the strings to uppercase
Joins multiple strings into a single string.
prompt <- "Hello, my name is"
first <- "James"
last <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")[1] "Hello, my name is Bond , James Bond"
Note
Similar to paste() and paste0()
Combines into a single string.
[1] "Hello, my name is Bond James Bond"
Note
str_c() will do the same thing, but it it is encouraged to use str_flatten() instead.
Uses environment to create a string and evaluates {expressions}.
My name is Bond, James Bond
Tip
See the R package glue!
Refer to the stringr cheatsheet
Remember that str_xxx functions need the first argument to be a vector of strings, not a data set.
filter() or mutate(). name is_bran manuf type calories protein
1 100% Bran TRUE N cold 70 4
2 100% Natural Bran TRUE Q cold 120 3
3 All-Bran TRUE K cold 70 4
4 All-Bran with Extra Fiber TRUE K cold 50 4
5 Almond Delight FALSE R cold 110 2
6 Apple Cinnamon Cheerios FALSE G cold 110 2
7 Apple Jacks FALSE K cold 110 2
8 Basic 4 FALSE G cold 130 3
9 Bran Chex TRUE R cold 90 2
10 Bran Flakes TRUE P cold 90 3
11 Cap'n'Crunch FALSE Q cold 120 1
12 Cheerios FALSE G cold 110 6
13 Cinnamon Toast Crunch FALSE G cold 120 1
14 Clusters FALSE G cold 110 3
15 Cocoa Puffs FALSE G cold 110 1
16 Corn Chex FALSE R cold 110 2
17 Corn Flakes FALSE K cold 100 2
18 Corn Pops FALSE K cold 110 1
19 Count Chocula FALSE G cold 110 1
20 Cracklin' Oat Bran TRUE K cold 110 3
21 Cream of Wheat (Quick) FALSE N hot 100 3
22 Crispix FALSE K cold 110 2
23 Crispy Wheat & Raisins FALSE G cold 100 2
24 Double Chex FALSE R cold 100 2
25 Froot Loops FALSE K cold 110 2
26 Frosted Flakes FALSE K cold 110 1
27 Frosted Mini-Wheats FALSE K cold 100 3
28 Fruit & Fibre Dates; Walnuts; and Oats FALSE P cold 120 3
29 Fruitful Bran TRUE K cold 120 3
30 Fruity Pebbles FALSE P cold 110 1
31 Golden Crisp FALSE P cold 100 2
32 Golden Grahams FALSE G cold 110 1
33 Grape Nuts Flakes FALSE P cold 100 3
34 Grape-Nuts FALSE P cold 110 3
35 Great Grains Pecan FALSE P cold 120 3
36 Honey Graham Ohs FALSE Q cold 120 1
37 Honey Nut Cheerios FALSE G cold 110 3
38 Honey-comb FALSE P cold 110 1
39 Just Right Crunchy Nuggets FALSE K cold 110 2
40 Just Right Fruit & Nut FALSE K cold 140 3
41 Kix FALSE G cold 110 2
42 Life FALSE Q cold 100 4
43 Lucky Charms FALSE G cold 110 2
44 Maypo FALSE A hot 100 4
45 Muesli Raisins; Dates; & Almonds FALSE R cold 150 4
46 Muesli Raisins; Peaches; & Pecans FALSE R cold 150 4
47 Mueslix Crispy Blend FALSE K cold 160 3
48 Multi-Grain Cheerios FALSE G cold 100 2
49 Nut&Honey Crunch FALSE K cold 120 2
50 Nutri-Grain Almond-Raisin FALSE K cold 140 3
51 Nutri-grain Wheat FALSE K cold 90 3
52 Oatmeal Raisin Crisp FALSE G cold 130 3
53 Post Nat. Raisin Bran TRUE P cold 120 3
54 Product 19 FALSE K cold 100 3
55 Puffed Rice FALSE Q cold 50 1
56 Puffed Wheat FALSE Q cold 50 2
57 Quaker Oat Squares FALSE Q cold 100 4
58 Quaker Oatmeal FALSE Q hot 100 5
59 Raisin Bran TRUE K cold 120 3
60 Raisin Nut Bran TRUE G cold 100 3
61 Raisin Squares FALSE K cold 90 2
62 Rice Chex FALSE R cold 110 1
63 Rice Krispies FALSE K cold 110 2
64 Shredded Wheat FALSE N cold 80 2
65 Shredded Wheat 'n'Bran TRUE N cold 90 3
66 Shredded Wheat spoon size FALSE N cold 90 3
67 Smacks FALSE K cold 110 2
68 Special K FALSE K cold 110 6
69 Strawberry Fruit Wheats FALSE N cold 90 2
70 Total Corn Flakes FALSE G cold 110 2
71 Total Raisin Bran TRUE G cold 140 3
72 Total Whole Grain FALSE G cold 100 3
73 Triples FALSE G cold 110 2
74 Trix FALSE G cold 110 1
75 Wheat Chex FALSE R cold 100 3
76 Wheaties FALSE G cold 100 3
77 Wheaties Honey Gold FALSE G cold 110 2
fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
1 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.40297
2 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.98368
3 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.42551
4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.70491
5 2 200 1.0 14.0 8 -1 25 3 1.00 0.75 34.38484
6 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.50954
7 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.17409
8 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.03856
9 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.12025
10 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.31381
11 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.04285
12 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.76500
13 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.82357
14 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.40021
15 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.73645
16 0 280 0.0 22.0 3 25 25 1 1.00 1.00 41.44502
17 0 290 1.0 21.0 2 35 25 1 1.00 1.00 45.86332
18 0 90 1.0 13.0 12 20 25 2 1.00 1.00 35.78279
19 1 180 0.0 12.0 13 65 25 2 1.00 1.00 22.39651
20 3 140 4.0 10.0 7 160 25 3 1.00 0.50 40.44877
21 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.53382
22 0 220 1.0 21.0 3 30 25 3 1.00 1.00 46.89564
23 1 140 2.0 11.0 10 120 25 3 1.00 0.75 36.17620
24 0 190 1.0 18.0 5 80 25 3 1.00 0.75 44.33086
25 1 125 1.0 11.0 13 30 25 2 1.00 1.00 32.20758
26 0 200 1.0 14.0 11 25 25 1 1.00 0.75 31.43597
27 0 0 3.0 14.0 7 100 25 2 1.00 0.80 58.34514
28 2 160 5.0 12.0 10 200 25 3 1.25 0.67 40.91705
29 0 240 5.0 14.0 12 190 25 3 1.33 0.67 41.01549
30 1 135 0.0 13.0 12 25 25 2 1.00 0.75 28.02576
31 0 45 0.0 11.0 15 40 25 1 1.00 0.88 35.25244
32 1 280 0.0 15.0 9 45 25 2 1.00 0.75 23.80404
33 1 140 3.0 15.0 5 85 25 3 1.00 0.88 52.07690
34 0 170 3.0 17.0 3 90 25 3 1.00 0.25 53.37101
35 3 75 3.0 13.0 4 100 25 3 1.00 0.33 45.81172
36 2 220 1.0 12.0 11 45 25 2 1.00 1.00 21.87129
37 1 250 1.5 11.5 10 90 25 1 1.00 0.75 31.07222
38 0 180 0.0 14.0 11 35 25 1 1.00 1.33 28.74241
39 1 170 1.0 17.0 6 60 100 3 1.00 1.00 36.52368
40 1 170 2.0 20.0 9 95 100 3 1.30 0.75 36.47151
41 1 260 0.0 21.0 3 40 25 2 1.00 1.50 39.24111
42 2 150 2.0 12.0 6 95 25 2 1.00 0.67 45.32807
43 1 180 0.0 12.0 12 55 25 2 1.00 1.00 26.73451
44 1 0 0.0 16.0 3 95 25 2 1.00 1.00 54.85092
45 3 95 3.0 16.0 11 170 25 3 1.00 1.00 37.13686
46 3 150 3.0 16.0 11 170 25 3 1.00 1.00 34.13976
47 2 150 3.0 17.0 13 160 25 3 1.50 0.67 30.31335
48 1 220 2.0 15.0 6 90 25 1 1.00 1.00 40.10596
49 1 190 0.0 15.0 9 40 25 2 1.00 0.67 29.92429
50 2 220 3.0 21.0 7 130 25 3 1.33 0.67 40.69232
51 0 170 3.0 18.0 2 90 25 3 1.00 1.00 59.64284
52 2 170 1.5 13.5 10 120 25 3 1.25 0.50 30.45084
53 1 200 6.0 11.0 14 260 25 3 1.33 0.67 37.84059
54 0 320 1.0 20.0 3 45 100 3 1.00 1.00 41.50354
55 0 0 0.0 13.0 0 15 0 3 0.50 1.00 60.75611
56 0 0 1.0 10.0 0 50 0 3 0.50 1.00 63.00565
57 1 135 2.0 14.0 6 110 25 3 1.00 0.50 49.51187
58 2 0 2.7 -1.0 -1 110 0 1 1.00 0.67 50.82839
59 1 210 5.0 14.0 12 240 25 2 1.33 0.75 39.25920
60 2 140 2.5 10.5 8 140 25 3 1.00 0.50 39.70340
61 0 0 2.0 15.0 6 110 25 3 1.00 0.50 55.33314
62 0 240 0.0 23.0 2 30 25 1 1.00 1.13 41.99893
63 0 290 0.0 22.0 3 35 25 1 1.00 1.00 40.56016
64 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.23588
65 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.47295
66 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.80179
67 1 70 1.0 9.0 15 40 25 2 1.00 0.75 31.23005
68 0 230 1.0 16.0 3 55 25 1 1.00 1.00 53.13132
69 0 15 3.0 15.0 5 90 25 2 1.00 1.00 59.36399
70 1 200 0.0 21.0 3 35 100 3 1.00 1.00 38.83975
71 1 190 4.0 15.0 14 230 100 3 1.50 1.00 28.59278
72 1 200 3.0 16.0 3 110 100 3 1.00 1.00 46.65884
73 1 250 0.0 21.0 3 60 25 3 1.00 0.75 39.10617
74 1 140 0.0 13.0 12 25 25 2 1.00 1.00 27.75330
75 1 230 3.0 17.0 3 115 25 1 1.00 0.67 49.78744
76 1 200 3.0 17.0 3 110 25 1 1.00 1.00 51.59219
77 1 200 1.0 16.0 8 60 25 1 1.00 0.75 36.18756
“Regexps are a very terse language that allow you to describe patterns in strings.”
R for Data Science
R uses “extended” regular expressions, which are common.
Web app to test R regular expressions
Tip
Regular expressions are a reason to use stringr!
You might encounter gsub(), grep(), etc. from Base R.
. ^ $ \ | * + ? { } [ ] ( )[1] "She" "sells" "seashells" "by" "the" "seashore!"
. Represents any character
[1] "She" "sells" "seashells" "by" "the" "seashore!"
^ Looks at the beginning
$ Looks at the end
[1] "shes" "shels" "shells" "shellls" "shelllls"
? Occurs 0 or 1 times
+ Occurs 1 or more times
* Occurs 0 or more times
[1] "shes" "shels" "shells" "shellls" "shelllls"
{n} matches exactly n times.
{n,} matches at least n times.
{n,m} matches between n and m times.
()Groups can be created with ( )
| – “either” / “or”
toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
toung_twister2[1] "Peter" "Piper" "picked" "a" "peck" "of" "pickled"
[8] "peppers!"
[][] Character Classes\w Looks for any “word” (conversely “not” “word” \W)
\d Looks for any digit (conversely “not” digit \D)
\s Looks for any whitespace (conversely “not” whitespace \S)
Discuss with a neighbor which regular expressions would search for words that do the following:
Test your answers out on
\In order to match a special character you need to “escape” first
Warning
In general, look at punctuation characters with suspicion.
[1] "How" "much" "wood" "could" "a" "woodchuck"
[7] "chuck" "if" "a" "woodchuck" "could" "chuck"
[13] "wood?"
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)
I use the stringr cheatsheet more than any other package cheatsheet!
Be kind to yourself when working with regular expressions!
Read the regular expressions out loud like a “request”
tidyversestringr functions + dplyr verbs!Find countries that start with an “A”:
| Country |
|---|
| Africa |
| Algeria |
| Angola |
| Americas |
| Argentina |
| Asia & Oceania |
| Afghanistan |
| Australia |
| Albania |
| Armenia |
| Azerbaijan |
| Austria |
matches(pattern)Selects all variables with a name that matches the supplied pattern
select(), rename_with(), and across()I received this data from a grad school colleague the other day who asked if I knew how to “clean” it.
What is that column?!
[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]
stringr!Let’s see how this works.
In this activity, you will be using regular expressions to decode a message.
Remember, the stringr functions go inside dplyr verbs like mutate() and filter(). Think of them as you would as.factor()